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Abstract — Prognostics has taken a center stage in Condition 
Based Maintenance (CBM) where it is desired to estimate 
Remaining Useful Life (RUL) of the system so that 
remedial measures may be taken in advance to avoid 
catastrophic events or unwanted downtimes. Validation of 
such predictions is an important but difficult proposition 
and a lack of appropriate evaluation methods renders 
prognostics meaningless. Evaluation methods currently used 
in the research community are not standardized and in many 
cases do not sufficiently assess key performance aspects 
expected out of a prognostics algorithm. In this paper we 
introduce several new evaluation metrics tailored for 
prognostics and show that they can effectively evaluate 
various algorithms as compared to other conventional 
metrics. Specifically four algorithms namely; Relevance 
Vector Machine (RVM), Gaussian Process Regression 
(GPR), Artificial Neural Network (ANN), and Polynomial 
Regression (PR) are compared. These algorithms vary in 
complexity and their ability to manage uncertainty around 
predicted estimates. Results show that the new metrics rank 
these algorithms in different manner and depending on the 
requirements and constraints suitable metrics may be 
chosen. Beyond these results, these metrics offer ideas 
about how metrics suitable to prognostics may be designed 
so that the evaluation procedure can be standardized. 1 2 

Table of Contents 


1. Introduction 1 

2. Motivation 2 

3. Previous Work 2 

4. Application Domain 3 

5. Algorithms Evaluated 3 

6. Performance Metrics 5 

7. Results & Discussion 7 

1 


8. Conclusion 9 

References 10 

Biographies 10 


1. Introduction 

Prognostics is an emerging concept in condition based 
maintenance (CBM) of critical systems. Along with 
developing the fundamentals of being able to confidently 
predict Remaining Useful Life (RUL), the technology calls 
for fielded applications as it inches towards maturation. 
This requires a stringent performance evaluation so that the 
significance of the concept can be fully exploited. 
Currently, prognostics concepts lack standard definitions 
and suffer from ambiguous and inconsistent interpretations. 
This lack of standards is in part due to the varied end-user 
requirements for different applications, a wide range of time 
scales involved, available domain information, domain 
dynamics, etc. to name a few issues. Instead, the research 
community has used a variety of metrics based largely on 
convenience with respect to their respective requirements. 
Very little attention has been focused on establishing a 
common ground to compare different efforts. 

This paper builds upon previous work that surveyed metrics 
in use for prognostics in a variety of domains including 
medicine, nuclear, automotive, aerospace, and electronics. 
[1]. The effort suggested a list of metrics to assess critical 
aspects of RUL predictions. This paper will show how such 
metrics can be used to assess the performance of a 
prognostics algorithm. Furthermore, it will assess whether 
these metrics capture the performance criteria they were 
designed for. The paper will focus on metrics that are 
specifically designed for prognostics beyond conventional 
metrics currently being used for diagnostics and other 
forecasting applications. These metrics in general address 
the issue of how well the RUL prediction estimates improve 
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over time as more measurement data become available. A 
good prognostic algorithm should not only improve in RUL 
estimation but also ensure a reasonable prediction horizon 
and confidence levels on the predictions. 

Overall the paper is expected to enhance a general 
understanding behind these metrics so that they can be 
further refined and be accepted by the research community 
as standard metrics for the performance assessment of 
prognostics algorithms. 


2. Motivation 

Prognostics technology is reaching a point where it must be 
evaluated in real world environments in a truly integrated 
fashion. This, however, requires rigorous testing and 
evaluation on a variety of performance measures before 
they can be certified for critical systems. For end-of-life 
predictions of critical systems, it becomes imperative to 
establish a fair amount of faith in the prognostic systems 
before incorporating their predictions into the decision- 
making process. Furthermore, performance metrics help 
establish design requirements that must be met. In the 
absence of standardized metrics it has been difficult to 
quantify acceptable performance limits and specify crisp 
and unambiguous requirements to the designers. 
Performance evaluation allows comparing different 
algorithms and also yields constructive feedback to further 
improve these algorithms. 

Performance evaluation is usually the foremost step once a 
new technique is developed. In many cases benchmark 
datasets or models are used to evaluate such techniques on a 
common ground so they can be fairly compared. Prognostic 
systems, in most cases, have neither of these options. 
Different researchers have used different metrics to evaluate 
their algorithms that makes it rather difficult to compare 
various algorithms even if they have been declared 
successful based on their respective evaluations. It is 
accepted that prognostics methods must be tailored for 
specific applications, which makes it difficult to develop a 
generic algorithm useful for every situation. In such cases 
customized metrics may be used but there are characteristics 
of prognostics applications that remain unchanged and 
corresponding performance evaluation can lay a common 
ground for comparisons. So far very little has been done to 
identify a common ground when it comes to testing and 
comparing different algorithms. In two surveys of methods 
for prognostics (one of data-driven methods and one of 
artificial-intelligence-based methods) [2, 3], it can be seen 
that there is a lack of standardized methodology for 
performance evaluation and in many cases performance 
evaluation is not even formally addressed. Even the ISO 
standard [4] for prognostics in condition monitoring and 
diagnostics of machines lacks a firm definition of such 
metrics. Therefore, in this paper we present several new 
metrics and show how they can be effectively used to 


compare different algorithms. With these ideas we hope to 
provide some starting points for future discussions. 


3. Previous Work 

In a recent effort a thorough survey of various application 
domains that employ prediction related tasks was conducted 
[1]. The central idea, there, was to identify established 
methods of performance evaluation in the domains that can 
be considered mature and already have fielded applications. 
Specifically, domains like medicine, weather, nuclear, 
finance and economics, automotive, aerospace, electronics, 
etc. were considered. The survey revealed that although 
each domain employs a variety of custom metrics, metrics 
based on accuracy and precision dominated the landscape. 
However, these metrics were often used in different 
contexts depending on the type of data available and the 
kind of information derived out of them. This suggests that 
one must interpret the usage very carefully before 
borrowing any concepts from other domains. A brief 
summary of the findings is presented here for reference. 

Domains like medicine and finance heavily utilize statistical 
measures. These domains benefit from availability of large 
datasets under different conditions. Predictions in medicine 
are based on hypothesis testing methodologies and metrics 
like accuracy, precision, interseparability, and resamblance 
are computed on test outcomes. In finance statistical 
measures are computed on errors computed based on 
reference prediction models. Metrics like MSE (mean 
squared error), MAD (mean absolute), MdAD (median 
absolute deviation), MAPE (mean absolute percentage 
error), and their several variations are widely used. These 
metrics represent different ways of expressing accuracy and 
precision measures. The domain of weather predictions 
mainly uses two classes of evaluation methods, error based 
statistics and measures of resolution between two outcomes. 
A related domain of wind mill power prediction uses 
statistical measures already listed above. Other domains like 
aerospace, electronics, and nuclear are relatively immature 
as far as fielded prognostics applications are concerned. In 
addition to conventional accuracy and precision measures a 
significant focus has been on metrics that assess business 
merits like ROI (return on investment), TV (technical 
value), life cycle cost other than reliability based metrics 
like MTBF (mean time between failure) or the ratio 
MTBF/MTBUR (mean time between unit replacements). 

Several classifications of these metrics have been presented 
in [1] that are derived from the end use of the prognostics 
information. It has been argued that depending on the end 
user requirements one must choose appropriate set of these 
metrics or their variants to appropriately evaluate the 
performance of the algorithms. 
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4. Application Domain 

In this section we describe the application domain we used 
to show how these new prognostics metrics may be applied 
and can be used to compare various different algorithms. 

INL Battery Dataset 

In 1998 the Office of Vehicle Technologies at the U.S. 
Department of Energy initiated the Advanced Technology 
Development (ATD) program in order to find solutions to 
the barriers limiting commercialization of high-power 
Lithium-ion batteries for hybrid-electric and plug-in electric 
vehicles. Under this program, a set of second-generation 
18650-size Lithium-ion cells were cycle-life tested at the 
Idaho National Laboratory (INL). 

The cells were aged under different experimental settings 
like temperature, State-of-Charge (SOC), current load, etc. 
Regular characterization tests were performed to measure 
behavioral changes from the baseline under different aging 
conditions. The test matrix consisted of three SOCs (60, 80, 
and 100%), four temperatures (25, 35, 45, and 55 Q C), and 
three different life tests (calendar-life, cycle-life, and 
accelerated-life) [5]. Electrode Impedance Spectroscopy 
(EIS) measurements were recorded every four weeks to 
estimate battery health. EIS measurements were then used 
to extract internal resistance parameters (R E and Rct, see 
Figure 1) that have been shown to empirically characterize 
ageing characteristics using a lumped parameter model for 
the Li-ion batteries [6]. 
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Figure 1 - Internal parameters values are used as 
features extracted from EIS measurements to 

characterize battery health. 

As shown in Figure 2, battery capacity was also measured 
in ampere hours by measuring time and currents during 
discharge cycle for the batteries. For the data used in our 
study, the cells were aged at 60% SOC and at temperatures 
of 25°C and 45°C. The 25°C data is used solely for training 
while the 45°C data is used for both training as well as 
testing. 

Different approaches can be taken to predict battery life 
based on above measurements. One approach makes use of 
EIS measurements to compute R e +Rct and then uses 
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prediction algorithms to predict evolution of these 
parameters. R e +Rct have been shown to be directly 
connected to battery capacity and hence their evolution 
curve can be easily transformed into battery RUL. Another 
approach directly tracks battery capacity and trends it to 
come up with RUL estimates. In the next sections we 
describe our prediction algorithms and the corresponding 
approaches that were used to estimate battery life. 


Battery Capacity Decay with Time 



Time (weeks) 

Figure 2 -Battery capacity decay profile at 45°C. 


5. Algorithms Evaluated 

In this effort we chose four data-driven algorithms to show 
the effectiveness of various metrics in evaluating the 
performance. These algorithms range from simple 
polynomial regression to sophisticated Bayesian learning 
methods. The approaches used here have been described 
before in [6, 7], but they are repeated here for the sake of 
completeness and readability. Also mentioned briefly is the 
procedure how each of these algorithms were applied to the 
battery health management dataset. 

Polynomial Regression (PR) Approach 

We employed a simple data-driven routine to establish a 
baseline for battery health prediction performance and 
uncertainty assessment. For this data-driven approach, as 
the first step, the equivalent damage threshold in the 
Re+Rct (d t h=0. 033) is gleaned from the relationship 
between R e +Rct and the capacity C at baseline temperature 
(25°C). Next, via extracted features from the EIS 
measurements, R e +Rct was tracked at elevated 
temperatures (here 45°C). Ignoring the first two data points 
(which behave similar to what is considered as “wear-in” 
pattern in other domains), a second degree polynomial was 
used at the prediction points to extrapolate out to the 
damage threshold. This linear extrapolation is then used to 
compute the expected RUL values. 

Relevance Vector Machines (RVM) 

The Relevance Vector Machine (RVM) [8] is a Bayesian 
form representing a generalized linear model of identical 
functional form of the Support Vector Machine (SVM) [9]. 
Although, SVM is a state-of-the-art technique for 
classification and regression, it suffers from a number of 
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disadvantages, one of which is the lack of probabilistic 
outputs that make more sense in health monitoring 
applications. The RVM attempts to address these very 
issues in a Bayesian framework. Besides the probabilistic 
interpretation of its output, it uses a lot fewer kernel 
functions for comparable generalization performance. 

This type of supervised machine learning starts with a set of 
input vectors {x„}, n = 1,..., N 9 and their corresponding 
targets {t„}. The aim is to learn a model of the dependency 
of the targets on the inputs in order to make accurate 
predictions of t for unseen values of x. Typically, the 
predictions are based on some function F(x) defined over 
the input space, and learning is the process of inferring the 
parameters of this function. The targets are assumed to be 
samples from the model with additive noise: 

t„ =F(x„;w)+£„ (1) 


m(x) = E[/0)], 

k{ x,x') = E[(/(x) - m(x))(f(x’) - m(x'))], and ( 4 ) 

f{x) ~ GP(m(x),k(x,x')). 

The index set XeR is the set of possible inputs, which 
need not necessarily be a time vector. Given prior 
information about the GP and a set of training points {(Xifi)\ 
i = l,...,n}, the posterior distribution over functions is 
derived by imposing a restriction on prior joint distribution 
to contain only those functions that agree with the observed 
data points. These functions can be assumed to be noisy as 
in real world situations we have access to only noisy 
observations rather than exact function values, i.e. y t = fix) 
+ e, where e is additive IID j V( 0,er„ 2 ). Once we have a 
posterior distribution it can be used to assess predictive 
values for the test data points. Following equations describe 
the predictive distribution for GPR [11]. 


where, e n are independent samples from some noise process 
(Gaussian with mean 0 and variance a 2 ). Assuming the 
independence of t n9 the likelihood of the complete data set 
can be written as: 
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where, w = (wi, w 2 ,..., w M ) T is a weight vector and d> is the 
N x (JV+1) design matrix with d> = [<p(t{) 9 #?(t 2 ), ... (p(t N ) 9 ~\ T \ 
in which <p{t N ) = [1, K(x n9 x{) 9 ^(x„,x 2 ), ^(x„,x A .)] r , 

being a kernel function. 


To prevent over-fitting a preference for smoother functions 
is encoded by choosing a zero-mean Gaussian prior 
distribution over w parameterized by the hyperparameter 
vector rj. To complete the specification of this hierarchical 
prior, the hyperpriors over rj and the noise variance o 1 are 
approximated as delta functions at their most probable 
values rjMp and o'mp • Predictions for new data are then made 
according to: 

Pit* 1 1) = J />(t* I y/,(7 2 MP )p(w | \.,r) MP ,a 1 MP )dyi. (3) 


Posterior 

fusi \ x > y> x us, ~ N (frnt - cov (/« ))> where 

L, - E[/„ | X,y , = K(X, X lesl )[K(X, X) + allf y, (6) 

covC/^,) = K(X lesl , ) - K(X lesl , X) + a 2 n if K(X, X^, ). 

A crucial ingredient in a Gaussian process predictor is the 
covariance function (K(X 9 X ')) that encodes the assumptions 
about the functions to be learnt by defining the relationship 
between data points. GPR requires a prior knowledge about 
the form of covariance function, which must be derived 
from the context if possible. Furthermore, covariance 
functions consist of various hyper-parameters that define 
their properties. Setting right values of such hyper- 
parameters is yet another challenge in learning the desired 
functions. Although the choice of covariance function must 
be specified by the user, corresponding hyper-parameters 
can be learned from the training data using a gradient based 
optimizer such as maximizing the marginal likelihood of the 
observed data with respect to hyper-parameters [12]. 


Gaussian Process Regression (GPR) 

Gaussian Process Regression (GPR) is a probabilistic 
technique for nonlinear regression that computes posterior 
degradation estimates by constraining the prior distribution 
to fit the available training data [10]. A Gaussian Process 
(GP) is a collection of random variables any finite number 
of which have a joint Gaussian distribution. A real GV fix) 
is completely specified by its mean function m(x) and co- 
variance function &(x,x’) defined as: 


We used GPR to regress the evolution of internal 
parameters (Re+Rct) of the battery with time at. 
Relationship between these parameters and the battery 
capacity was learned from experimental data at [7]. Thus 
the internal parameters were regressed for the data obtained 
at and the corresponding estimates were translated into 
estimated battery capacity at 45°C using the relationship 
learnt at 25°C. 

Neural Network (NN) Approach 

A neural network based approach was considered as an 
alternative data-driven approach for prognostics. A basic 
feed forward neural network with back propagation training 
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was used, details on this algorithm can be found in .[13, 14] 
As described earlier for other approaches, data at 25°C was 
used to learn the relationship between internal parameter 
Re+Rct and the capacity C using the neural network NN\. In 
addition, the 45°C data was used as a test case. Here, 
measurements of the internal parameter Re+Rct are only 
available up to time t P (time at which RUL prediction is 
made). The available Re+Rct measurements are 
extrapolated after time in order to predict future values. 
This extrapolation is done using neural network NN 2 which 
learns the relationship between R e +Rct and time. Once 
future values for R e +Rct are computed using NN 2 these 
Re+Rct values are the used as an input to NNi in order to 
obtain C. 

The structure of NN\ consists of two hidden layers with one 
and three nodes respectively. For the hidden layers tan- 
sigmoid transfer functions and for the output layers log- 
sigmoid transfer functions were chosen. Training considers 
random initial weights, a reduced memory Levenberg- 
Marquardt algorithm, 200 training epochs, and mean- 
squared error as a performance parameter. 

The structure and training parameters of the NN 2 remained 
fixed during the forecasting. The net was trained with data 
available up to week 32, and then the resulting model was 
used to extrapolate R e +Rct until f E0P is reached or it is clear 
that it will not be reached if the model does not converge. 
Once the next measurement point is available at week 36, 
the net was trained again including the new data point. The 
resulting model was used to extrapolate R f +Rct from 
f p+ i=36 onwards. It is not expected that a fixed net structure 
and fixed training settings could perform optimally for all 
the training instances as measurements become available 
week 32 onwards. To make sure the results are acceptable 
for all the training instances, the initial weights were set to 
random and the training was repeated 30 times. This 
allowed the exploration of with different initial values in the 
optimization of the weights and allowed the exploration of 
different local minimums. The results of the 30 training 
cases were aggregated on the extrapolated values by 
computing the median. Cases were observed where the 
training stopped prematurely resulting in a net with poor 
performance, these cases were regarded as outliers and the 
use of the median was intended to diminish the impact of 
such outliers while aggregating all the training cases. The 
structure of NN 2 consists of one hidden layer with three 
nodes, and tan-sigmoid transfer functions for all the layers. 
Training considers random initial weights, a reduced 
memory Levenberg-Marquardt algorithm, 200 training 
epochs, and mean-squared error as a performance 
parameter. 

6. Performance Metrics 

In this section nine different performance metrics have been 
described. Four of them are the metrics most widely used in 


the community, i.e., accuracy, precision, Mean Squared 
Error (MSE), and Mean Absolute Percentage Error 
(MAPE). These metrics have been included to illustrate the 
idea about how these metrics are useful but may not 
encapsulate time varying aspects of prognostic estimates. 
Further, five new metrics have been introduced that 
encapsulate such features of interest. These metrics have 
been first defined briefly and then evaluated based on the 
results for battery health management as presented in the 
following section. 

Terms and Notations 

• UUT is the unit under test 

• A 6 7 (z) is the error between the predicted and the true RUL 
at time index i for UUT /. 

• EOP (End-of-Prediction) is the earliest time index, i, 
after prediction crosses the failure threshold. 

• EOL represents End-of-Life, the time index for actual 
end of life defined by the failure threshold. 

• P is the time index at which the first prediction is made 
by the prognostic system. 

• /(/) is the RUL estimate at time U given that data is 
available up to time t t for the f h UUT. 

• t is the cardinality of the set of all time indices at which 
the predictions are made, i.e. £ = |(/ 1 p < / < EOPj ■ 

Average Bias (Accuracy) 

Average bias is one of the conventional metrics that has 
been used in many ways as a measure of accuracy. It 
averages the errors in predictions made at all subsequent 
times after prediction starts for the f h UUT. This metric can 
be extended to average biases over all UUTs to establish 
overall bias. 





( 7 ) 


Sample Standard Deviation (Precision) 

Sample standard deviation measures the dispersion/spread 
of the error with respect to the sample mean of the error. 
This metric is restricted to the assumption of normal 
distribution of the error. It is, therefore, recommended to 
carry out a visual inspection of the error plots to determine 
the distribution characteristics before interpreting this 
metric. 




where m is the sample mean of the error. 

Mean Squared Error (MSE) 

Simple average bias metric suffers from the fact that 
negative and positive errors cancel each other and high 
variance may not be reflected in the metric. Therefore, MSE 
averages the squared prediction error for all predictions and 
encapsulates both accuracy and precision. A derivative of 
MSE, often used, is Root Mean Squared Error (RMSE). 


within specified limits around the actual EOL and hence the 
predictions may be considered trust worthy. It is expected 
that PHs are determined for an algorithm-application pair 
offline during the validation phase and then these numbers 
be used as guidelines when the algorithm is deployed in test 
application where actual EOL are not known in advance. 
While comparing algorithms, an algorithm with longer 
prediction horizon would be preferred. 

H = EOP - i (11) 


MSE = ±f j A(i ) 2 ■ 

f M 


(9) where i=min[/ 1 (je i) a (r.(l -a)< r l (j) < n(l + or))}. 


Mean Absolute Percentage Error (MAPE) 

For prediction applications it is important to differentiate 
between errors observed far away from the EOL than those 
are observed close to EOL. Smaller errors are desirable as 
EOL approaches. Therefore, MAPE weighs errors with 
RULs and averages the absolute percentage errors in the 
multiple predictions. Instead of the mean, median can be 
used to compute Median absolute percentage error 
(MdAPE) in a similar fashion. 


1 1 

mape=-Y 

D ^ 


100A(Q 
r. ( I ) 


( 10 ) 


It must be noted that the above metrics can be more suitably 
used in cases where either a distribution of RUL predictions 
is available as the algorithm output or there are multiple 
predictions available from several UUTs to compute the 
statistics. Whereas these metrics can convey meaningful 
information in these cases, these metrics are not designed 
for applications where RULs are continuously updated as 
more data is available. It is desirable to have metrics that 
can characterize improvement in the performance of a 
prognostic algorithm as time approaches near end-of-life. In 
this paper we discuss one such application where algorithms 
are tracking battery health and show how newer metrics can 
encapsulate such information which is valuable for 
successful fielded application of prognostics. Therefore, 
next we discuss new metrics tailored for prognostics and 
show how they are more informative than the ones 
traditionally used. 


Prognostic Horizon (PH) 

Prediction Horizon has been in the literature for quite some 
time but no formal definition is available. The notion 
suggests that longer the prognostics horizon more time is 
available to act based on a prediction that has some 
credibility. We define Prognostic Horizon as the difference 
between the current time index i and EOP utilizing data 
accumulated up to the time index i, provided the prediction 
meets desired specifications. This specification may be 
specified in terms of allowable error bound (a) around true 
EOL. This metric ensures that the predicted estimates are 


For instance, a PH with error bound of a = 5% identifies 
when a given algorithm starts predicting estimates that are 
within 5% of the actual EOL. Other specifications may be 
used to derive PH as desired. 

a-A Accuracy 

Another way to quantify prediction quality may be through 
a metric that determines whether the prediction falls within 
specified accuracy levels at a given specific time. These 
times instances may be specified as percentage of total 
remaining life from the point the first prediction is made or 
a given absolute time interval before EOL is reached. In our 
implementation we define a-A accuracy as the prediction 
accuracy to be within a* 1 00% of the actual RUL at specific 
time instance h expressed as a fraction of time between the 
point when an algorithm starts predicting and the actual 
failure.; For example, it determines whether a prediction 
falls within 20% accuracy (i.e., a=0.2) halfway to failure 
from the time the first prediction is made (i.e., A =0.5). 

[l-«]-r.W<r'(^)<[l + «].r.( ? ) (12) 

where a : accuracy modifier 

A: time window modifier 

t x = p+a(eol~p)- 



Figure 3 - Schematic depicting a-/. Accuracy. 
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Relative Accuracy (RA) 

Relative prediction accuracy is a notion similar to a-X 
accuracy where instead of finding out whether the 
predictions fall within a given accuracy levels at a given 
time instant we measure the accuracy level. The time instant 
is again described as a fraction of actual remaining useful 
life from the point when the first prediction is made. An 
algorithm with higher relative accuracy is desirable. 


*4=1- 


1 

r Xh) 


(13) 


where t x =P+A{EOL-P). 



Figure 4 - Schematic showing Relative Accuracy 
concept. 


Let (x c , y c ) be the center of mass of the area under the curve 
M(i). then, the convergence C M can be represented by the 
Euclidean distance between the center of mass and (t p9 0), 
where 


ip) + y c 9 

1 EOP 


2 i=P 

x = — — — 


and 




i=P 
\ EOP 


1 EOP 

Z i=P 

EOP 


(15) 


M(i) is a non-negative prediction error accuracy or precision 
metric. 



Cumulative Relative Accuracy (CRA) 

Relative accuracy can be evaluated at multiple time 
instances. To aggregate these accuracy levels we define 
Cumulative Relative Accuracy as a normalized weighted 
sum of relative prediction accuracies at specific time 
instances. 

CRA x = X -j^w{r‘)RA x ( 14 ) 

£ 1=1 

Where w is a weight factor as a function of RUL at all time 
indices. In most cases it is desirable to weigh the relative 
accuracies higher closer to the EOL. 

Convergence 

Convergence is defined to quantify the manner in which any 
metric like accuracy or precision improves with time to 
reach its perfect score. As illustrated below, three cases 
converge at different rates. It can be shown that the distance 
between the origin and the centroid of the area under the 
curve for a metric quantifies convergence. Lower the 
distance faster the convergence. Convergence is a useful 
metric since we expect a prognostics algorithm to converge 
to true value as more information accumulates over time. 
Further, a faster convergence is desired to achieve a high 
confidence keeping the prediction horizon as large as 
possible. 


Figure 5 - Schematic for the convergence of a metric. 

7. Results & Discussion 

As mentioned earlier battery health measurements were 
taken every four weeks. Therefore, each algorithm was 
tasked to predict every four weeks after the week 32, which 
gives about eight data points to learn the degradation trend. 
Algorithms predict RULs until the end-of-prediction is 
reached, i.e. the estimates show that battery capacity has 
already hit 70% of the full capacity of one ampere hour. 
Corresponding predictions are then evaluated using all nine 
metrics. Algorithms like RVM always predicted 
conservatively, i.e. predicted a faster degradation than 
actually observed. Estimates were available for all weeks 
starting week 32 through week 64. Other algorithms like 
NN and PR started predicting at week 32 but could not 
predict beyond week 60 as their estimates had already 
crossed the failure threshold before that. GPR, however, 
required more training data before it could provide any 
estimates. Therefore, predictions for GPR start at week 48 
and go until week 60. 


Table 1 - Performance evaluation for all four test 
algorithms with Error Bound = 5%. 



RVM 

GPR 

NN 

PR 

Bias 

-7.12 

5.96 

5.04 

1.87 

SSD 

6.57 

15.24 

6.81 

4.26 

MSE 

84.81 

184.16 

59.49 

17.35 
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MAPE 

41.36 

53.93 

37.54 

23.05 

PH 

8.46 

12.46 

12.46 

24.46 

RA (X = 0.5) 

0.60 

0.86 

0.34 

0.82 

CRA (X = 0.5) 

0.63 

0.52 

0.55 

0.65 

Convergence 

14.80 

8.85 

13.36 

11.41 


Prediction Horizon (5% error) 



Figure 6 - Predictions from different algorithms fall 
within the error bound at different times. 


In Table 1 results are aggregated based on all available 
predictions. These results clearly show that polynomial fit 
approach outperforms all other algorithms in almost all 
cases. Even though the convergence properties are not the 
best they are comparable to the top numbers. However, 
using all predictions to compute these metrics results in a 
wide range of values, which makes it difficult to assess how 
other algorithms fare even if they may not necessarily be the 
best. Most metrics describe how close or far the predictions 
are to the true value but prediction horizon indicates when 
these predictions enter within the specified error bound and 
therefore may be trust worthy (see Figure 6). PR enters the 
error bound early on where as all other algorithms converge 
slowly as times passes by. The convergence metric 
encapsulates this attribute and shows that algorithms like 
GPR converge faster to better estimates and may be useful 
later on. We also learned that the current convergence 
metric does not take into account cases where algorithms 
start predicting at different time instances. In such cases 
algorithms that start predicting early on may have a 
disadvantage. Although this metric works well in most 
cases, few adjustments may be needed to make it robust 
towards extreme cases. 

It must be pointed out that these metrics summarize all 
predictions, good or bad, into one aggregate, which may not 
be fare for algorithms that learn over time and get better 
later on. Therefore, next, it was decided to evaluate only 
those predictions that were made within the prediction 
horizon so that only the meaningful predictions are 
evaluated (Table 2). As expected the results change 
significantly and all the performance numbers become 


comparable for all algorithms. This provides a better 
understanding on how these algorithms compare. 


Table 2 - Performance evaluation for all four test 
algorithms for predictions made within prediction 
horizon with Error Bound = 5%. 



RVM 

GPR 

NN 

PR 

Bias 

-1.19 

-1.78 

-1.53 

0.22 

SSD 

1.18 

1.33 

1.45 

3.33 

MSE 

2.03 

3.96 

3.27 

7.75 

MAPE 

39.33 

30.40 

27.44 

23.25 

PH 

8.46 

12.46 

12.46 

24.46 

RA (X = 0.5) 

0.77 

0.62 

0.69 

0.95 

CRA (X = 0.5) 

0.50 

0.31 

0.33 

0.58 

Convergence 

3.76 

4.44 

4.61 

7.36 


Another aspect of performance evaluation is the 
requirement specifications. As specifications change the 
performance evaluation criteria also changes. To illustrate 
this point, prediction horizon was now defined on a relaxed 
error bound of 10%. As expected prediction horizons 
become longer for most of the algorithms and hence more 
predictions are taken into account while computing the 
metrics. Table 3 shows the results with the new prediction 
horizons and now the NN based approach also seems to 
perform well on several criteria. This means that for some 
applications where more relaxed requirements are 
acceptable simpler approaches may be chosen if needed. 


Table 3 - Performance evaluation for all four test 
algorithms for predictions made within prediction 
horizon with Error Bound = 10%. 



RVM 

GPR 

NN 

PR 

Bias 

-1.83 

0.05 

-1.53 

0.22 

SSD 

1.73 

4.34 

1.45 

3.33 

MSE 

5.02 

10.6 

3.27 

7.75 

MAPE 

37.01 

31.20 

27.44 

23.25 

PH 

12.46 

16.46 

12.46 

24.46 

RA (X = 0.5) 

0.76 

0.79 

0.69 

0.95 

CRA (k = 0.5) 

0.57 

0.43 

0.33 

0.58 

Convergence 

5.49 

3.43 

4.61 

7.36 


Figure 7 shows the a-X Accuracy metric for all four 
algorithms. Since all algorithms except GPR start prediction 
from week 32 onward, tx is determined to be around 48.3 
weeks. At that point only PR lies within 80% accuracy 
levels. GPR starts predicting week 44 onward, i.e. its tx is 
determined to be around 54.3 week where it seems to meet 
the requirements. This metric signifies whether a particular 
algorithm reaches within a desired accuracy level hallway 
to the EOL from the point it starts predicting. Another 
aspect that may be of interest is whether an algorithm 
reaches the desired accuracy level some fixed time interval 
ahead of the EOL. In that case, for example, if tx is chosen 
as 48 weeks then GPR will not meet the requirement. 
Therefore, this metric may be modified to incorporate cases 
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where not all algorithms may be able to start predicting at 
the same time. 


a-X Accuracy ( a = 0.2, X = 0.5) 



Figure 7 - The a-X Accuracy metric determines whether 
predictions are within the cone of desired accuracy 
levels at a given time instant fo). 



Time (weeks) 


Figure 8 - Battery capacity decay profile shows several 
features that are difficult to learn using simple 
regression techniques. 

It can be observed from the results (Figure 7) most 
algorithms fail to follow the trend towards the end. These 
approaches being data-driven regression based techniques 
find it difficult to learn the physical phenomenon by which 
batteries degrade. As shown in Figure 8, initially the battery 
capacity degrades quite fast and then the degradation rate 
slows down before towards the end it further slows down. 
These algorithms are not able to learn this characteristic and 
predict an earlier EOL. 

Finally, we would like to mention few key points that are 
important for performance evaluation and should be 
considered ahead of choosing the metrics. Time scales 
observed in various prognostic algorithms are often very 
different in different applications. For instance, in battery 
health management time scales are in the order of weeks 
where as in other cases like electronics it may be a matter of 
hours or seconds. Therefore, the chosen metrics should 
acknowledge the importance of prediction horizon and 


weigh errors close to EOL with higher penalties. Next, these 
metrics may be modified to address asymmetric preference 
on RUL error. In most applications where a failure may lead 
to catastrophic outcomes an early prediction is preferred 
over late predictions. Finally, in the example discussed in 
this paper RUL estimates were obtained as a single value as 
against a RUL distribution for every prediction. The metrics 
presented in this paper can be applied to such applications 
with slight modifications. Similarly for cases where multiple 
UUTs are available to provide data, minor adjustments will 
suffice. 


8. Conclusion 

In this paper we have shown how performance metrics for 
prognostics can be designed. Four different prediction 
algorithms were used to show how various metrics convey 
different kinds of information. No single metric should be 
expected to cover all performance criteria. Depending on 
the requirements a subset of these metrics should be chosen 
and a decision matrix should be used to rank different 
algorithms. In this paper we used nine metrics including 
four conventional ones that are most commonly used to 
evaluate algorithm performance. The new metrics provide 
additional information that may be useful in comparing 
prognostic algorithms in particular. Specifically these 
metrics track the evolution of prediction performance over 
time and help determine when these predictions can be 
considered trust worthy. Notions like convergence and 
prediction horizon that have existed in the literature for a 
long time have been quantified so they can be used in 
automated fashion. Further new notions of performance 
measures at specific time instances have been instantiated 
using metrics like relative accuracy and a-X performance. 
These metrics represent the notion that a prediction is useful 
only if it allows certain amount of time to mitigate the 
predicted contingency. 

Whereas these metrics demonstrate several ideas specific to 
prognostics performance evaluation, we by no means claim 
this list to be near perfect. It is anticipated that as new ideas 
are generated and the metrics themselves are evaluated in 
different applications, this list will be revised and refined 
before a standard methodology can be devised for 
evaluating prognostics. This paper is intended to serve as a 
start towards developing such metrics that can better 
encapsulate prognostic algorithm performance. 
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